This project was done as a part of “Data Analysis” course as part of our teams studies in Digital Sciences for High-Tech in the University of Tel-Aviv. Our team has great interest in using our studies for exploring and in the future maybe even developing tools for improving the way we treat our planet. Therefore our subject is CO2 emissions in congestion with the world happiness index.
happy_index_2005 <- "Data/world-happiness-report-2005-2020.csv"
happy_index_2021 <- "Data/world-happiness-report-2021.csv"
emission <- "Data/co2_emission.csv"
population <-"Data/population.csv"
df_2005 <- read.csv(happy_index_2005)
df_2021 <- read.csv(happy_index_2021)
df_emiss <- read.csv(emission, col.names = c("Entity", "Code" , "Year", "co2"))
df_pop <- read.csv(population, col.names = c("Entity", "Code" , "Year", "Population"))
The first Data consists of the UN happiness index for the years 2005-2017.
as_tibble(df_2005)
The second data is the happiness index report for 2021.
as_tibble(df_2021)
The third data contains CO2 emissions from around 1700 up until 2017, by countries and continents.
as_tibble(df_emiss)
The final data is a list of population sizes by countries for about the same years as the CO2 data.
as_tibble(df_pop)
As seen in the table above, the main dataset regarding emissions has only 4 features, of which 2 are identical (countries and country code). We will now examine the summary of it’s characteristics:
summary(df_emiss)
Entity Code Year co2
Length:20853 Length:20853 Min. :1751 Min. :-6.255e+08
Class :character Class :character 1st Qu.:1932 1st Qu.: 3.188e+05
Mode :character Mode :character Median :1971 Median : 3.829e+06
Mean :1953 Mean : 1.931e+08
3rd Qu.:1995 3rd Qu.: 3.707e+07
Max. :2017 Max. : 3.615e+10
As seen above, we can see the minimal year, minimal amount of CO2 emissions and same for mean and max values. We also learn about the size of the data, around ~21k rows.
[1] "ï..Country.name" "year" "Life.Ladder" "Log.GDP.per.capita"
[5] "Social.support" "Healthy.life.expectancy.at.birth" "Freedom.to.make.life.choices" "Generosity"
[9] "Perceptions.of.corruption" "Positive.affect" "Negative.affect"
Joining, by = "Entity"
We have just created the two main datasets to work with, which includes important features from all of the previous datasets and some features that we have created above:
We also made a choice to limit our data regarding emissions to after 1950. The reason is that the rise in values is almost exponential after the years of WW2. When plotted on a graph, placing values from before and after 1950 they become incomparable.
The summary for df_merge:
summary(df_merge)
Entity avg.ratio avg.Population avg.co2 Life.Ladder sum.dif
Length:138 Min. : 0.03123 Min. :2.900e+05 Min. :2.539e+05 Min. :2.662 Min. :-675355128
Class :character 1st Qu.: 0.67361 1st Qu.:4.486e+06 1st Qu.:4.375e+06 1st Qu.:4.593 1st Qu.: 142380
Mode :character Median : 3.04396 Median :9.762e+06 Median :2.017e+07 Median :5.587 Median : 6399728
Mean : 4.87430 Mean :4.360e+07 Mean :1.902e+08 Mean :5.463 Mean : 102976353
3rd Qu.: 7.70521 3rd Qu.:2.873e+07 3rd Qu.:9.600e+07 3rd Qu.:6.262 3rd Qu.: 28576960
Max. :25.67181 Max. :1.296e+09 Max. :5.558e+09 Max. :7.788 Max. :7786512016
And for df_all:
summary(df_all)
Entity Code Year co2 Diff Population ratio
Length:17854 Length:17854 Min. :1950 Min. :-6.255e+08 Min. :-600971748 Min. :1.000e+03 Min. : 0.000
Class :character Class :character 1st Qu.:1967 1st Qu.: 4.746e+05 1st Qu.: -9810 1st Qu.:2.400e+05 1st Qu.: 0.408
Mode :character Mode :character Median :1984 Median : 4.430e+06 Median : 43968 Median :3.386e+06 Median : 1.841
Mean :1984 Mean : 2.423e+08 Mean : 5310082 Mean :6.108e+07 Mean : 5.211
3rd Qu.:2002 3rd Qu.: 4.593e+07 3rd Qu.: 1018269 3rd Qu.:1.230e+07 3rd Qu.: 6.384
Max. :2019 Max. : 3.615e+10 Max. :1543508239 Max. :7.713e+09 Max. :403.351
NA's :3595 NA's :3595 NA's :914 NA's :4509
skim(df_merge) # TODO Mayber remove this ? seems to much
-- Data Summary ------------------------
Values
Name df_merge
Number of rows 138
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None
-- Variable type: character ------------------------------------------------------------------------------------------------------------------------------------------
# A tibble: 1 x 8
skim_variable n_missing complete_rate min max empty n_unique whitespace
* <chr> <int> <dbl> <int> <int> <int> <int> <int>
1 Entity 0 1 4 24 0 138 0
-- Variable type: numeric --------------------------------------------------------------------------------------------------------------------------------------------
# A tibble: 5 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 avg.ratio 0 1 4.87 5.71 3.12e-2 0.674 3.04 7.71 25.7 ▇▃▁▁▁
2 avg.Population 0 1 43601578. 146848485. 2.90e+5 4485573. 9761806. 28733782. 1295924678. ▇▁▁▁▁
3 avg.co2 0 1 190209421. 689048147. 2.54e+5 4375076. 20167978. 95996205. 5558492798. ▇▁▁▁▁
4 Life.Ladder 0 1 5.46 1.16 2.66e+0 4.59 5.59 6.26 7.79 ▂▆▇▇▅
5 sum.dif 0 1 102976353. 694705139. -6.75e+8 142380. 6399728. 28576960. 7786512016 ▇▁▁▁▁
In this plot the different countries in the data are shown in deeper colors for higher values of emission to population ratio in the year of 2017
d = df_all %>%
filter(Year==2017)
l <- list(color = toRGB("grey"), width = 0.2)
# specify map projection/options
g <- list(
showframe = FALSE,
showcoastlines = FALSE,
projection = list(type = 'Mercator')
)
p <- plot_geo(d) %>%
add_trace(
z = ~ratio, color = ~ratio, colors = 'Reds',
text = ~Entity, locations = ~Code, marker = list(line = l)
) %>%
colorbar(title = 'CO2/Population', thickness=15) %>%
layout(
title = 'World Ratio of CO2/Population',
geo = g,
autosize = F
)
p <- ggplotly(p, width = 3000, height = 1500, automargin = TRUE)
p
Yarden - As we can see, some of the leading countries are Qatar, United Arab Emirates, Kuwait, USA, Saudi Arabia, Canada, Kazakhstan and Australia with above 15 units of CO2/population. That means countries like China and India - whose CO2 emission rate is very high comparing to other countries - are ranked lower, even though they are more polluting, because the population in them is much larger.
For comfort of the view, each country is allocated to it’s fitting continent and a color.
fig <- d %>%
plot_ly(
x = ~co2,
y = ~Population,
size = ~ratio,
frame = ~Year,
text = ~Entity,
color = ~continent,
type = 'scatter',
mode = 'markers',
height = 500,
width = 900,
automargin = TRUE
)
fig <- fig %>% layout(
title = "Yearly CO2 Emissions by Countries vs Population",
xaxis = list(
type = "log"
# autosize = F
)
)
fig
We would like to develop this graph forward and see it with correlation to the life ladder index from 2005-2017. For this purpose we would like to create another df, containing both happiness index scores and data from “df_all”, which contains continents data as well now:
We would receive the following plot:
fig <- plot_ly(
data = df_2005_2017,
x = ~co2,
y = ~Life.Ladder,
size = ~ratio,
frame = ~Year,
text = ~Entity,
color = ~continent,
type = 'scatter',
mode = 'markers',
height = 500,
width = 900,
automargin = TRUE
)
fig <- fig %>% layout(
title = "Yearly CO2 Emissions by Countries vs Life Ladder",
xaxis = list(
type = "log"
)
)
fig
And yet another scatter plot for different countries:
df_3d <- df_2005_2017 %>%
filter(Year == 2017)
fig <- df_3d %>%
plot_ly(
x = ~Population,
y = ~ratio,
z = ~Life.Ladder,
# size = ~Life.Ladder,
# frame = ~Entity,
text = ~Entity,
color = ~continent,
height = 500,
width = 900,
automargin = TRUE
)
fig <- fig %>% layout(
title = "CO2 Emissions vs Life Ladder, Filtered by Countries, 2017"
)
fig <- fig %>% add_markers()
fig
Yarden - The graph shows the CO2 emissions of each country comparing to it’s population. If we look at the United States for example, it seems that while the population has increased, the level of pollution has decreased.
# d <- df_all[ , c("Entity", "Year", "Population")]
d <- df_all %>%
filter(Entity != "World") %>%
filter(co2 != 0)
d
pp <- streamgraph(d, key="Entity",
order = "asis",
value="co2",
date="Year",
offset="zero",
sort="co2"
) %>%
sg_axis_y(tick_format = "e") %>%
sg_legend(show=TRUE, label="Country: ") %>%
sg_title("CO2 Emissions by Years and Countries, 1950-2017")
pp
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
Yarden - This data shows the accumulation of emissions year after year, from 1950 to 2017.
Yarden - This graph shows the happiness index for each country from 1950 to the present.
pp <- streamgraph(df_2005, key = "Entity",
# order = "reverse",
value="Life.Ladder",
date="year"
# offset="zero",
) %>%
sg_fill_brewer("Blues") %>%
sg_legend(show=TRUE, label="Country: ") %>%
sg_title("Happiness Index by Years and Countries, 1950-2017")
pp
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
# g <- ggplot(data = df_2005, aes(x = year, y = Life.Ladder, group = Entity)) +
# geom_line() +
# geom_line(color="#69b3a2") +
# theme_ipsum()
#
# g <- g + labs(title = "Contries Hapinness by Years")
# g <- ggplotly(g)
# g
top_emiss_2017 <- df_emiss %>%
filter(Year == 2017) %>%
filter(Code != '') %>%
filter(Entity != "World") %>%
top_n( 10, co2) %>%
arrange(desc(co2)) %>%
head(10)
top_countries = top_emiss_2017$Entity
g <- ggplot(data = df_emiss[df_emiss$Entity %in% top_countries, ]
, aes(x = Year, y = co2, group = Entity)) +
geom_line() +
labs(title = "Top 10 Countries by CO2 Emissions") +
geom_line(aes(col = Entity)) +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
g <- ggplot(data = df_all[df_all$Entity %in% top_countries, ]
, aes(x = Year, y = ratio, group = Entity)) +
geom_line(aes(col = Entity)) +
labs(title = "Top 10 Countries by CO2 Emissions to Population ratio") +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
NA
Yarden - According to these two graphs, if we test the CO2 emission per person - in countries with a large population the ratio decreases. For example, Saudi Arabia has a low level of pollution, but when examining the pollution ratio per population, in 2017 Saudi Arabia had the highest emission per person ratio in the world.
d <- df_all[df_all$Entity %in% top_countries, ]
fig <- plot_ly(d,
x = ~Entity,
y = ~Diff,
type = "box",
color = ~Year,
height = 500,
width = 900,
automargin = TRUE)
fig <- fig %>% layout(
title = "Difference in Emission for Top 10 Polluting Countries",
boxmode = "group")
fig
NA
Yarden - This graph takes the 10 happiest countries and shows their level of emission between 1800 and 2017.
top_10_happiness <- df_merge %>%
filter(rank(desc(Life.Ladder))<=10)
g <- ggplot(data = df_emiss[df_emiss$Entity %in% top_10_happiness$Entity, ]
, aes(x = Year, y = co2, group = Entity)) +
geom_line() +
labs(title = "Top 10 happiness Countries by CO2 Emissions") +
theme_ipsum() +
geom_line(aes(col = Entity))
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
In this section we are going to suggest two types of correlation based on our exploration of the data:
g <- ggplot(data = df_merge, aes(x = avg.ratio, y = Life.Ladder)) +
geom_point(aes(colour = Entity)) +
labs(title = "Average Ratio vs Happiness Index") +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
In the following plot we have removed china and India because they are sort of “outliers” in a sense that they have displayed much larger grow in emissions compared to other countries, therefore making it very hard to display on the same plot.
df <- subset(df_merge, Entity != "China" & Entity != "India" )
g <- ggplot(data = df, aes(x = Life.Ladder , y = sum.dif)) +
geom_point(aes(colour = Entity)) +
labs(title = "Sum of Emissions Difference vs 2017 Happiness Index (China, India ex.)") +
theme_ipsum()
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
# Part 4: Modeling
We are now going to introduce the linear model:
linearmod <- lm(Life.Ladder ~ avg.ratio, data=df_merge)
summary(linearmod)
Call:
lm(formula = Life.Ladder ~ avg.ratio, data = df_merge)
Residuals:
Min 1Q Median 3Q Max
-2.2447 -0.6262 -0.0959 0.6765 2.1682
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.88716 0.10612 46.055 < 2e-16 ***
avg.ratio 0.11818 0.01416 8.346 7.09e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9468 on 136 degrees of freedom
Multiple R-squared: 0.3387, Adjusted R-squared: 0.3338
F-statistic: 69.65 on 1 and 136 DF, p-value: 7.093e-14
confint(linearmod, level=0.95)
2.5 % 97.5 %
(Intercept) 4.67731147 5.0970157
avg.ratio 0.09017982 0.1461869
p <- plot_ly(
x = fitted(linearmod),
y = residuals(linearmod),
height = 500,
width = 900,
automargin = TRUE)
p<- p %>% layout(
title = "Residual plot for the Linear Model",
autosize = T,
yaxis = list(title = 'Residuals'),
xaxis = list(title = 'Fitted Linear Model')
)
p
g <- ggplot(df_merge, aes(x = avg.ratio, y = Life.Ladder)) +
geom_point(aes(colour = Entity)) +
theme_bw() +
stat_smooth(method = "lm") +
labs(title = "Linear Modelling Happiness Index and Average Ratio")
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
`geom_smooth()` using formula 'y ~ x'
g
NA
NA
Yarden - According to the model we present here, it can be seen that the correlation between the level of happiness and the CO2 emission is not high, but it can be said that there is a connection. Some insights can be deduced from this: 1. The population of a counrty is not necessarily aware of the level of state pollution, so there are other factors that affect their level of happiness like welfare, health services, wealth etc. 2. The majority of the population is not bothered by the issue of pollution. Although there are people who do care, their percentage in the population is small and therefore their impact on the level of happiness within the population is low.
On the other hand:
In conclusion, in our opinion, the people may be aware of their country emissions, yet the issue is not one of their first problems and they feel distant. Because of that, their sense of commitment is weak and doesn’t affect their happiness.
Another thing is, that people started noticing the trend of the advanced countries in the world to stop polluting and start thinking more about the environment. Countries have begun to take significant steps to reduce pollution and preserve the environment.
numeric_tidy <- df_merge[-1]
corr_data <- cor(numeric_tidy)
g <- ggcorrplot(corr_data, hc.order = TRUE, type = "lower",
outline.col = "white") +
labs(title = "Correlation Matrix") +
theme_ipsum()
# colors = c("darkolivegreen2", "white", "darkolivegreen"))
ax <- list(
title = "",
showgrid = FALSE)
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE) %>%
layout(xaxis=ax, yaxis=ax)
g
NA
NA
The following section presents another model and it’s summary:
gmod <- lm(Life.Ladder ~ log(avg.ratio), data=df_merge)
summary(gmod)
Call:
lm(formula = Life.Ladder ~ log(avg.ratio), data = df_merge)
Residuals:
Min 1Q Median 3Q Max
-2.03388 -0.55089 0.04812 0.59537 1.89653
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.15234 0.07550 68.25 <2e-16 ***
log(avg.ratio) 0.48724 0.04229 11.52 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8283 on 136 degrees of freedom
Multiple R-squared: 0.4939, Adjusted R-squared: 0.4902
F-statistic: 132.7 on 1 and 136 DF, p-value: < 2.2e-16
#plot x
p <- qplot(x = avg.ratio, Life.Ladder, data=df_merge)
p <- p + geom_smooth(method = "glm", formula = y~log(x), family = binomial(link = 'log')) +
geom_point(aes(colour=Entity)) +
theme_ipsum() +
labs(title = "Linear Regression Model - Average Ratio to Life Ladder Score")
p <- ggplotly(p, height = 500,
width = 900,
automargin = TRUE)
p
On this model, We can see a week correlation, but an existing one to the logistic regression. Interestingly enough, The “happiest” countries seem to have a higher ratio of CO2 emissions as well. Also worth mentioning from this plot: * The US is located far right on this map, meaning it’s emitting a lot of CO2 in proportion to it’s population, but it is pretty high on the happiness index as well. * Kuwait, being the leader of this unfortunate characteristic, also has a pretty decent Life Ladder score. *
#multiple regression
ex_list <- c("China", "India", "United States")
d <- df_merge[(!(df_merge$Entity %in% ex_list)),]
x <- d$avg.ratio + d$sum.dif + d$avg.co2
multi <- lm(Life.Ladder ~ x, data=d)
p <- qplot(x, Life.Ladder, data=d)
p <- p +
geom_smooth(method = "lm",
formula = y~x, family = binomial(link = 'log')) +
geom_point(aes(colour=Entity)) +
theme_ipsum() +
labs(title = "Linear Regression Model -Multiple Variables to Life Ladder Score")
Ignoring unknown parameters: family
fig <- ggplotly(p, height = 500,
width = 900,
automargin = TRUE)
fig
NA
Now let’s say we want to compare the differences between the top 10 happiness countries avg.ratio is higher then other countries. In this case, we have two unrelated (i.e., independent or unpaired) groups of samples. Therefore, it’s possible to use an independent t-test to evaluate whether the means are different.
mA - Weighted average of the 10 most happiest countries. mB - Weighted average of the rest of the countries.
Our research questions:
Is the weighted average of the 10 happiest countries (mA) greater than the weighted average of the other countries (mB)?
H0:mA≥mB - The null hypothesis
Ha:mA<mB (less) - The alternative hypothesis
# Create a data frame
T_data <- df_merge %>%
select(Entity, avg.ratio,) %>%
mutate(group = ifelse(Entity %in% top_10_happiness$Entity, "Top 10 Contries", "Other Countries"))
## TODO Yarden - Need to add to T_data a column of Life.Ladder
## TODO make a graph that shows the difference of Life Ladder and avg.ratio by groups "top_10" and "others".
g <- ggboxplot(T_data, x = "group", y = "avg.ratio",
color = "group", palette = c("#00AFBB", "#E7B800"),
ylab = "avg. ratio", xlab = "Group") +
theme_ipsum() +
labs(title = "Average Ratio Compared - Top 10 Countries vs Other Countries")
g <- ggplotly(g, height = 500,
width = 900,
automargin = TRUE)
g
NA
NA
#Do the two group have the same variances?
#We’ll use F-test to test for homogeneity in variances.
res.ftest <- var.test(avg.ratio ~ group, data = T_data)
res.ftest
F test to compare two variances
data: avg.ratio by group
F = 3.8362, num df = 127, denom df = 9, p-value = 0.03252
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
1.132105 8.499782
sample estimates:
ratio of variances
3.836239
The p-value of the F-test is p = 0.03252. It is lower than the significance level alpha = 0.05. In conclusion, there is a significant difference between the variances of the two data sets. Therefore, we used the T-test and assumed that the variances are not equal - according to case 1 of hypothesis testing.
# Compute t-test
res <- t.test(avg.ratio ~ group, data = T_data, var.equal = FALSE, alternative = "less")
res